# Create the sequences to be filtered against

Create a directory to store the filters in:
```
mkdir /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

## 1.1 Mitochondrial DNA

```
python make_target_sequences.py chrM
```

This will generate the file `chrM.fa` with the forward and the reverse sequence of chrM, and a trivial `chrM.psl` file with the position of the forward and reverse sequence with respect to the chrM chromosome. Store these files:
```
mv chrM.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv chrM.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

## 1.2 Ribosomal RNA

```
python make_target_sequences.py rRNA
```
This will generate the file `rRNA.fa` with 37 ribosomal sequences,
and `rRNA.psl` with the genomic locations of 29 of them.
Collect the transcript lengths:
```
faSize -detailed rRNA.fa > rRNA.chrom.sizes
```
Store these files:
```
mv rRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv rRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv rRNA.chrom.sizes /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

The genes coding for 18S, 28S and 5.8S rRNA are transcribed by RNA polymerase I; the genes coding for 5S rRNA are transcribed by RNA polymerase III.

## 1.3 Transfer RNA

Download the RepeatMasker annotations for hg38 from the UCSC Table Browser and store it as `hg38.rmsk.txt.gz` in the directory `/osc-fs_home/mdehoon/Data/RepeatMasker/hg38`. Then run
```
python script_tRNA_RepeatMasker.py
```
This generates the file `tRNA.gff`. Store this file in the same directory `/osc-fs_home/mdehoon/Data/RepeatMasker/hg38`.

Download the tRNA genes identified by tRNAscan-SE in hg38 from the UCSC Table Browser and store it as `tRNA.txt.gz" in the directory `/osc-fs_home/mdehoon/Data/UCSC/hg38`. Then run
```
python script_tRNA_UCSC.py
```
This generates the file `tRNA.gff`. Store this file in the same directory `/osc-fs_home/mdehoon/Data/UCSC/hg38`.

Download the Entrez Gene database (file `Homo_sapiens.ags.gz`) from NCBI, and run
```
python script_tRNA_NCBI.py
```
This generates the file `trnas.bed`. Store the database file `Homo_sapiens.ags.gz` and the generated file `trnas.bed` in `/osc-fs_home/mdehoon/Data/NCBI/hg38`.

Merge the tRNAs from RepeatMasker, UCSC, and NCBI into a single `.bed` file:
```
python mergetrna.py hg38
```
generating the file `mergedtrnas.bed`. Note that `mergetrna.py` uses the FANTOM5 short RNA data to decide the exact genomic extent of each tRNA. Rename and store the generated file:
```
mv mergedtrnas.bed /osc-fs_home/mdehoon/LSA/ShortRNAPipeline/Annotation/hg38/tRNA.bed 
```
Create a file with the tRNA sequences:
```
python make_target_sequences.py tRNA
```
This will generate the file `tRNA.fa` with 2089 transfer RNA sequences,
and `tRNA.psl` with the genomic locations of 1854 of them. The genomic locations for 235 transfer RNAs mapped to `chr*_alt` chromosomes were dropped.
Create a corresponding `.bed` file:
```
pslToBed tRNA.psl tRNA.bed
```
This `.bed` file will be used when annotating the generated `.bam` files with overlapping pre-tRNAs.
Store these files:
```
mv tRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv tRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv tRNA.bed /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

Genes coding for tRNAs are transcribed by RNA polymerase III (PMID: 17977614).

## 1.4 Small nuclear RNAs (spliceosomal RNAs)

```
python make_target_sequences.py snRNA
```
This generates a file `snRNA.fa` with the 36 small nuclear RNA sequences in RefSeq, and `snRNA.psl` with the genomic locations for 35 snRNAs. Create a `.bed` file with the snRNA locations:
```
pslToBed snRNA.psl snRNA.bed
```
Add snRNA annotations from Ensembl:
```
python add_ensembl_snrnas.py
```
which will write two new files `snRNA.fa` with 2073 snRNA sequences, and `snRNA.psl` with the genomic location for 1910 snRNAs.

Regenerate the `.bed` file:
```
pslToBed snRNA.psl snRNA.bed
```
This `.bed` file will be used when annotating the generated `.bam` files with overlapping pre-snRNAs.
Store these files:
```
mv snRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv snRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv snRNA.bed /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

Spliceosomal RNAs are produced by RNA polymerase II (PMID: 15564372), except for U6 snRNA, which is produced by RNA polymerase III (PMID: 17977614). The 7SK small nuclear RNA, also included here, is produced by RNA polymerase III (PMID: 911771 and 17977614).

## 1.5 Small cytoplasmic RNAs (7SL RNAs, Brain cytoplasmic RNA 1, and MALAT1-associated small cytoplasmic RNA)

```
python make_target_sequences.py scRNA
```
This generates a file `scRNA.fa` with the 6 small cytoplasmic RNA sequences in RefSeq, and `scRNA.psl` with their genomic locations.
Store these files:
```
mv scRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv scRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

The 7SL Signal Recognition Particle RNAs and BC200 brain cytoplasmic RNA are produced by RNA polymerase III; instead, MASCRNA is cleaved by RNase P and RNase Z from the MALAT1 transcript produced by RNA polymerase II:

| Accession   | Name    | Description | PubMed ID |
| ----------- | ------- | ----------- | --------- |
| NR_144569.1 | MASCRNA | Cleaved by RNases P and Z from MALAT1 | 19041754 |
| NR_002715.1 | RN7SL1  | Signal Recognition Particle RNA | 17977614 |
| NR_145670.1 | RN7SL3  | Signal Recognition Particle RNA | 17977614 |
| NR_027260.1 | RN7SL2  | Signal Recognition Particle RNA | 17977614 |
| NR_144555.1 | RN7SL832P | Signal Recognition Particle RNA pseudogene | 17977614 |
| NR_001568.1 | BCYRN1 | BC200 brain cytoplasmic RNA |  28761139 |

## 1.6 Small nucleolar RNAs

```
python make_target_sequences.py snoRNA
```
This generates a file `snoRNA.fa` with the 541 small nucleolar RNA sequences in RefSeq, and `snoRNA.psl` with their genomic locations.
Create a `.bed` file with the snoRNA locations:
```
pslToBed snoRNA.psl snoRNA.bed
```
Add snoRNA annotations from Ensembl:
```
python add_ensembl_snornas.py
```
which will write two new files, `snoRNA.fa` with 1034 snoRNA sequences, and `snoRNA.psl` with their genomic locations. This script also makes use of the Rfam (PMID:33211869) annotations from RNAcentral (PMID:33106848), and the snoRNA annotations from snoDB (PMID:31598696).

Regenerate the `.bed` file:
```
pslToBed snoRNA.psl snoRNA.bed
```
This `.bed` file will be used when annotating the generated `.bam` files with overlapping pre-snoRNAs.
Store these files:
```
mv snoRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv snoRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv snoRNA.bed /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

Small nucleolar RNAs U3, U8, and U13 are transcribed by RNA polymerase II; their 5' cap is converted to a 2,2,7 trimethylguanosine cap by methylation. Other small nucleolar RNAs are excised from protein-coding transcripts (produced by Pol-II), and have a 5' phosphate (PMID: 7797080).

## 1.7 Ro-associated RNAs Y1/Y3/Y4/Y5

```
python make_target_sequences.py yRNA
```
This generates a file `yRNA.fa` with the 4 Ro-associated RNA sequences in RefSeq, and `yRNA.psl` with their genomic locations.
Create a `.bed` file with the yRNA locations:
```
pslToBed yRNA.psl yRNA.bed
```

Store these files:
```
mv yRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv yRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv yRNA.bed /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

Ro-associated RNAs are transcribed by RNA polymerase III (PMID: 6187471).

## 1.8 Histone genes

```
python make_target_sequences.py histone
```
This generates a file `histone.fa` with the 136 histone transcripts in RefSeq, and `histone.psl` with their genomic locations.
Create a `.bed` file with the histone transcript locations:
```
pslToBed histone.psl histone.bed
```

Store these files:
```
mv histone.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv histone.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv histone.bed /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

## 1.9 RNA component of mitochondrial RNA processing endoribonuclease

```
python make_target_sequences.py RMRP
```
This generates a file `RMRP.fa` with the RMRP sequence in RefSeq, and `RMRP.psl` with its genomic location.
Create a `.bed` file with the histone transcript locations:
```
pslToBed RMRP.psl RMRP.bed
```

Store these files:
```
mv RMRP.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv RMRP.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv RMRP.bed /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

RMRP is transcribed by RNA polymerase III (PMID: 17977614).

## 1.10 Small Cajal body-specific RNAs

```
python make_target_sequences.py scaRNA
```
This generates a file `scaRNA.fa` with the 29 small Cajal body-specific RNA sequences in RefSeq, and `scaRNA.psl` with their genomic locations.
Create a corresponding `.bed` file:
```
pslToBed scaRNA.psl scaRNA.bed
```
This `.bed` file will be used when annotating the generated `.bam` files with overlapping pre-scaRNAs.
Store these files:
```
mv scaRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv scaRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv scaRNA.bed /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

Small Cajal body-specific RNAs are processed from introns of pre-mRNA transcripts, except for scaRNA2 and scaRNA17, which are transcribed independently by RNA polymerase-II (PMID: 19906720).

## 1.11 RNA component of the RNase P ribonucleoprotein

```
python make_target_sequences.py RPPH
```
This generates a file `RPPH.fa` with the RPPH sequence in RefSeq, and `RPPH.psl` with its genomic location.

Store these files:
```
mv RPPH.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv RPPH.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

RPPH is transcribed by RNA polymerase III (PMID:1871114).

## 1.12 Small ILF3/NF90-associated RNAs

```
python make_target_sequences.py snar
```
This generates a file `snar.fa` with the 28 small ILF3/NF90-associated RNA sequences in RefSeq, and `snar.psl` with their genomic locations.

Store these files:
```
mv snar.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv snar.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

Small ILF3/NF90-associated RNAs are produced by RNA polymerase III (PMID:17855395).

## 1.13 Telomerase RNA component

```
python make_target_sequences.py TERC
```
This generates a file `TERC.fa` with the TERC transcript sequence in RefSeq, and `TERC.psl` with its genomic location.

Store these files:
```
mv TERC.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv TERC.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

TERC is produced by RNA polymerase II (PMID: 7544491, 12769858).

## 1.14 Vault RNAs

```
python make_target_sequences.py vRNA
```
This generates a file `vRNA.fa` with the 4 vault RNA sequences in RefSeq, and `vRNA.psl` with their genomic locations.

Store these files:
```
mv vRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv vRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

Vault RNAs are produced by polymerase III (PMID: 17977614).

## 1.15 Metastatis associated lung adenocarcinoma transcript 1

```
python make_target_sequences.py MALAT1
```
This generates a file `MALAT1.fa` with the 3 MALAT1 transcript sequences in RefSeq, and `MALAT1.psl` with their genomic locations.
Store these files:
```
mv MALAT1.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv MALAT1.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

## 1.16 Small nucleolar RNA host genes

```
python make_target_sequences.py snhg
```
This generates a file `snhg.fa` with the 124 small nucleolar RNA host gene transcripts in RefSeq, and `snhg.psl` with their genomic locations.
Store these files:
```
mv snhg.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv snhg.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

## 1.17 Messenger RNAs

```
python make_target_sequences.py mRNA
```
The script will compare the RefSeq sequence to the genomic sequence inferred from their genomic location. If they are inconsistent but within 1% of each other, the RefSeq sequence is replaced by the genomic sequence. If they have a larger discrepancy, the RefSeq sequence is kept for filtering, but any reads aligning to them are not mapped to the genome.
This generates a file `mRNA.fa` with the 115876 mRNA sequences in RefSeq, and `mRNA.psl` with the genomic locations for 115397 mRNAs.

Store these files:
```
mv mRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv mRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```
Create a GFF file with an annotation of the gene associated with each mRNA:
```
python make_gene_annotation_gff.py mRNA
```
generating the file `mRNA.gff`. Each line in this file has an attribute `transcript` with the RefSeq accession number, and an attribute `gene` with the gene symbol.
Store this file:
```
mv mRNA.gff /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

Create an index for BWA for the messenger RNA sequences:
```
bwa index /osc-fs_home/mdehoon/Data/CASPARs/Filters/mRNA.fa
faSize -detailed /osc-fs_home/mdehoon/Data/CASPARs/Filters/mRNA.fa > /osc-fs_home/mdehoon/Data/CASPARs/Filters/mRNA.chrom.sizes
```
Draw the distribution of mRNA transcript lengths:
```
python -i make_figure_size_distribution.py mRNA
```
which also reports that mature mRNAs have a mean transcript length of 4115 nucleotides, a median transcript length of 3372 nucleotides, with 95.55% of mRNA transcripts longer than 1,000 nucleotides.

## 1.18 Long non-coding RNAs

```
python make_target_sequences.py lncRNA
```
The script will compare the RefSeq sequence to the genomic sequence inferred from their genomic location. If they are inconsistent but within 1% of each other, the RefSeq sequence is replaced by the genomic sequence. If they have a larger discrepancy, the RefSeq sequence is kept for filtering, but any reads aligning to them are not mapped to the genome.
This generates a file `lncRNA.fa` with the 45299 long non-coding RNA sequences in RefSeq, and `lncRNA.psl` with the genomic locations for 43674 lncRNAs.

Store these files:
```
mv lncRNA.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv lncRNA.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```
Create a GFF file with an annotation of the gene associated with each lncRNA:
```
python make_gene_annotation_gff.py lncRNA
```
generating the file `lncRNA.gff`. Each line in this file has an attribute `transcript` with the RefSeq accession number, and an attribute `gene` with the gene symbol.
Store this file:
```
mv lncRNA.gff /osc-fs_home/mdehoon/Data/CASPARs/Filters
```

Create an index for BWA for the long non-coding RNA sequences:
```
bwa index /osc-fs_home/mdehoon/Data/CASPARs/Filters/lncRNA.fa
faSize -detailed /osc-fs_home/mdehoon/Data/CASPARs/Filters/lncRNA.fa > /osc-fs_home/mdehoon/Data/CASPARs/Filters/lncRNA.chrom.sizes
```
Draw the distribution of lncRNA transcript lengths:
```
python -i make_figure_size_distribution.py lncRNA
```
which also reports that mature lncRNAs have a mean transcript length of 2611 nucleotides, a median transcript length of 1989 nucleotides, with 74.76% of lncRNA transcripts longer than 1,000 nucleotides.

## 1.19 Gencode transcripts

```
python make_target_sequences.py gencode
```
This generates a file `gencode.fa` with the 228048 transcript sequences in Gencode, and `gencode.psl` with their genomic locations.

Store these files:
```
mv gencode.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv gencode.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```
Create an index for BWA for the Gencode transcript sequences:
```
bwa index /osc-fs_home/mdehoon/Data/CASPARs/Filters/gencode.fa
faSize -detailed /osc-fs_home/mdehoon/Data/CASPARs/Filters/gencode.fa > /osc-fs_home/mdehoon/Data/CASPARs/Filters/gencode.chrom.sizes
```

## 1.20 FANTOM-CAT

```
python make_target_sequences.py fantomcat
```
The script will use the genomic location of each FANTOM-CAT transcript to generate its transcript sequence from the genome sequence.
This generates a file `fantomcat.fa` with 709176 FANTOM-CAT transcripts, and `fantomcat.psl` with their genomic locations.

Store these files:
```
mv fantomcat.fa /osc-fs_home/mdehoon/Data/CASPARs/Filters
mv fantomcat.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters
```
Create an index for BWA for the FANTOM-CAT transcript sequences:
```
bwa index /osc-fs_home/mdehoon/Data/CASPARs/Filters/fantomcat.fa
faSize -detailed /osc-fs_home/mdehoon/Data/CASPARs/Filters/fantomcat.fa > /osc-fs_home/mdehoon/Data/CASPARs/Filters/fantomcat.chrom.sizes
```

